{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "13cb272e",
   "metadata": {},
   "source": [
    "# Code documentation Q&A bot example with LangChain\n",
    "![picture](https://lancedb.github.io/lancedb/assets/ecosystem-illustration.png)\n",
    "\n",
    "This Q&A bot will allow you to query your own documentation easily using questions. We'll also demonstrate the use of LangChain and LanceDB using the OpenAI API.\n",
    "\n",
    "In this example we'll **Numpy 1.26** documentation, but, this could be replaced for your own docs as well"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9a0e829a",
   "metadata": {
    "id": "wgPbKbpumkhH"
   },
   "source": [
    "### Credentials\n",
    "\n",
    "Copy and paste the project name and the api key from your project page.\n",
    "These will be used later to [connect to LanceDB Cloud](#scroll-to=5q8m6GMD7sGu)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "6553603f",
   "metadata": {
    "id": "rqEXT5-fmofw"
   },
   "outputs": [],
   "source": [
    "project_slug = \"your-project-slug\"  # @param {type:\"string\"}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "36ef9c45",
   "metadata": {
    "id": "5LYmBomPmswi"
   },
   "outputs": [],
   "source": [
    "api_key = \"sk_...\"  # @param {type:\"string\"}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "33ba6af1",
   "metadata": {
    "id": "Xs6tr6CMnBrr"
   },
   "source": [
    "You can also set the LANCEDB_API_KEY as an environment variable. More details can be found <a href=\"https://github.com/lancedb/vectordb-recipes/tree/main/examples/RAG_Reranking/lancedb_cloud/README.md\">**here**</a>."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Le27BWs2vDbB"
   },
   "source": [
    "Since we will be using OPENAI API, let us set the OPENAI API KEY as well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "-2-fyVPKu9fl"
   },
   "outputs": [],
   "source": [
    "openai_api_key = \"sk-...\"  # @param {type:\"string\"}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1991331f-4316-417a-b693-e2f27cbe9ea7",
   "metadata": {},
   "source": [
    "### Installing dependencies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e8a49c31",
   "metadata": {},
   "outputs": [],
   "source": [
    "! pip install -U langchain langchain-openai langchain-community \"httpx<0.28\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "66638d6c",
   "metadata": {
    "id": "QR9W53zStdlz"
   },
   "outputs": [],
   "source": [
    "! pip install -qq tiktoken unstructured pandas lancedb"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "0QQL4lm8lTzg"
   },
   "source": [
    "### Importing libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "vP6d6JUShgqo"
   },
   "outputs": [],
   "source": [
    "import openai\n",
    "import os\n",
    "import re\n",
    "import pickle\n",
    "import requests\n",
    "import zipfile\n",
    "from pathlib import Path\n",
    "\n",
    "from langchain.document_loaders import UnstructuredHTMLLoader\n",
    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
    "from langchain.vectorstores import LanceDB\n",
    "from langchain_openai import OpenAI, OpenAIEmbeddings\n",
    "from langchain.chains import RetrievalQA\n",
    "\n",
    "os.environ[\"OPENAI_API_KEY\"] = openai_api_key\n",
    "assert openai.models.list() is not None"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "8eKRYd2F7v5n"
   },
   "source": [
    "### Get the data\n",
    "To make this easier, we've downloaded Numpy documentation and stored the raw HTML files for you to download. Once the docs are downloaded, we then use LangChain's HTML document readers to parse them and store them in LanceDB as a vector store, along with relevant metadata.\n",
    "By default we use numpy docs, but you can replace this with your own docs as well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "l0ezDr7suAf_"
   },
   "outputs": [],
   "source": [
    "numpy_docs = requests.get(\"https://numpy.org/doc/1.26/numpy-html.zip\")\n",
    "with open(\"numpy-html.zip\", \"wb\") as f:\n",
    "    f.write(numpy_docs.content)\n",
    "\n",
    "file = zipfile.ZipFile(\"numpy-html.zip\")\n",
    "file = file.extractall(path=\"numpy_docs\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "HJf8xZmX8VJC"
   },
   "source": [
    "We'll create a simple **helper function** that can help to extract metadata, so it can used later when querying with filters. In this case, we want to keep the lineage of the uri or path for each document that has been processed:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "5aljyqpUiViE"
   },
   "outputs": [],
   "source": [
    "# Pre-processing and loading the documentation\n",
    "\n",
    "# Next, let's pre-process and load the documentation. To make sure we don't need to do this repeatedly if we were updating code,\n",
    "# we're caching it using pickle so we can retrieve it again (this could take a few minutes to run the first time you do it).\n",
    "# We'll also add some more metadata to the docs here such as the title and version of the code:\n",
    "\n",
    "\n",
    "def get_document_title(document_list):\n",
    "    titles = []\n",
    "    for doc in document_list:\n",
    "        if \"metadata\" in doc and \"source\" in doc[\"metadata\"]:\n",
    "            m = str(doc[\"metadata\"][\"source\"])\n",
    "            title = re.findall(\"numpy_docs(.*).html\", m)\n",
    "            print(title)\n",
    "            if title:\n",
    "                titles.append(title[0])\n",
    "            else:\n",
    "                titles.append(\"\")\n",
    "        else:\n",
    "            titles.append(\"\")\n",
    "    return titles"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "PCufm9Xr8eWp"
   },
   "source": [
    "### Pre-processing and loading the documents\n",
    "\n",
    "Next, let's pre-process and load the documents. To make sure we don't need to do this repeatedly while updating code, we're caching it using pickle so it can be retrieved again (this could take a few minutes to run the first time you do it). We'll also add extra metadata to the docs here such as the title and version of the code:\n",
    "\n",
    "*Note*: This step might take up to 10 minutes to run!\n",
    "*Note*: If there is some issue with nltk package, kindly try using\n",
    "```\n",
    "import nltk\n",
    "nltk.download('punkt')\n",
    "```\n",
    "or try to manually install the [nltk_data](https://github.com/nltk/nltk_data/tree/gh-pages) package and unzip the **punkt tokenizer** zip and the **averaged_perceptron_tagger** zip file in the packages folder."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 443
    },
    "id": "964Z2sZA247g",
    "outputId": "236df468-a630-4691-85a4-886835cfc02d"
   },
   "outputs": [],
   "source": [
    "from tqdm import tqdm\n",
    "\n",
    "docs = []\n",
    "docs_path = Path(\"docs.pkl\")\n",
    "for p in tqdm(Path(\"numpy_docs\").rglob(\"*.html\")):\n",
    "    if p.is_dir():\n",
    "        continue\n",
    "    loader = UnstructuredHTMLLoader(str(p))\n",
    "    raw_document = loader.load()\n",
    "    # docs.append(raw_document)\n",
    "    title = get_document_title(raw_document)\n",
    "    m = {\"title\": title}\n",
    "    if raw_document:\n",
    "        raw_document[0].metadata.update(m)\n",
    "        raw_document[0].metadata[\"source\"] = str(raw_document[0].metadata[\"source\"])\n",
    "        docs.extend(raw_document)\n",
    "\n",
    "\n",
    "if docs:\n",
    "    with open(docs_path, \"wb\") as fh:\n",
    "        pickle.dump(docs, fh)\n",
    "else:\n",
    "    with open(docs_path, \"rb\") as fh:\n",
    "        docs = pickle.load(fh)\n",
    "\n",
    "len(docs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "cntAuaUU_TER"
   },
   "source": [
    "### Generating emebeddings from our docs\n",
    "\n",
    "Now that we have our raw documents loaded, we need to pre-process them to generate embeddings:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "dHw2DSAj3u9B"
   },
   "outputs": [],
   "source": [
    "text_splitter = RecursiveCharacterTextSplitter(\n",
    "    chunk_size=1000,\n",
    "    chunk_overlap=200,\n",
    ")\n",
    "documents = text_splitter.split_documents(docs)\n",
    "embeddings = OpenAIEmbeddings()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "IiM4DJvC_2dV"
   },
   "source": [
    "### Store data in LanceDB Cloud\n",
    "\n",
    "Let's connect to LanceDB so we can store our documents, It requires 0 setup !"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "GV77SSi-AK0v"
   },
   "outputs": [],
   "source": [
    "uri = \"db://\" + project_slug\n",
    "table_name = \"langchain_vectorstore\"\n",
    "\n",
    "vectorstore = LanceDB(\n",
    "    embedding=embeddings,\n",
    "    uri=uri,  # your remote database URI\n",
    "    api_key=api_key,\n",
    "    region=\"us-east-1\",\n",
    "    table_name=table_name,  # Optional, defaults to \"vectors\"\n",
    "    mode=\"overwrite\",  # Optional, defaults to \"overwrite\"\n",
    ")\n",
    "\n",
    "doc_ids = vectorstore.add_documents(documents=documents)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "sZOUxfqzXr1m"
   },
   "source": [
    "Now let's create our RetrievalQA chain using the LanceDB vector store:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "4nDltKClAhhU"
   },
   "outputs": [],
   "source": [
    "qa = RetrievalQA.from_chain_type(\n",
    "    llm=OpenAI(), chain_type=\"stuff\", retriever=vectorstore.as_retriever()\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "xoS-WKXMXvvR"
   },
   "source": [
    "And thats it! We're all setup. The next step is to run some queries, let's try a few:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "7SKSlyq2iwpK"
   },
   "source": [
    "### Query"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "6aSZr8fCXx9s",
    "outputId": "ac5b5663-d45f-48c0-9f0a-f272e1a3ec2d"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'query': 'tell me about the numpy library?',\n",
       " 'result': ' The NumPy library is an open source Python library that is used for working with numerical data in Python. It contains multidimensional array and matrix data structures, and provides methods for efficient operations on these arrays. It is widely used in various fields of science and engineering and is a core component of the scientific Python and PyData ecosystems. It also offers a large library of high-level mathematical functions for working with arrays and matrices. '}"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "query = \"tell me about the numpy library?\"\n",
    "qa.invoke(query)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "EtBw5EH7lv9_",
    "outputId": "1745f881-fa15-44b5-e692-b702babce734"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'query': \"What's the current version of numpy?\",\n",
       " 'result': '\\nThe current version of numpy is 1.16.4.'}"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "query = \"What's the current version of numpy?\"\n",
    "qa.invoke(query)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "fR4CmF9ylvzw",
    "outputId": "1b33bb78-4b3f-4dea-addd-75f56eb4e5e6"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'query': 'What kind of linear algebra related operations can be done in numpy?',\n",
       " 'result': ' The numpy package provides various operations related to linear algebra, such as decompositions, matrix eigenvalues, norms, solving equations and inverting matrices, and performing linear algebra on several matrices at once. It also has support for logic functions, masked array operations, mathematical functions, matrix library, miscellaneous routines, padding arrays, polynomials, random sampling, set routines, sorting, searching, counting, statistics, and window functions.'}"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "query = \"What kind of linear algebra related operations can be done in numpy?\"\n",
    "qa.invoke(query)"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
